At first, we import the data and take a glance at what it actually looks like.
df <- read.csv(file = 'lyon_housing.csv')
head(df,10)
| date_transaction | type_purchase | type_property | rooms_count | surface_housing | surface_effective_usable | surface_terrain | parkings_count | price | address | district | latitude | longitude | date_construction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19/10/31 | ancien | maison | 5 | 100 | NA | 247 | 0 | 530000 | 6 PAS DES ANTONINS | Villeurbanne | 45.78167 | 4.879333 | 03/06/11 11:38 |
| 18/11/26 | ancien | maison | 2 | 52 | NA | 156 | 0 | 328550 | 12 RUE DU LUIZET | Villeurbanne | 45.78324 | 4.884683 | 03/06/11 11:38 |
| 16/08/04 | ancien | appartement | 1 | 28 | 28.20 | 0 | 1 | 42500 | 4 RUE DE L ESPOIR | Villeurbanne | 45.78149 | 4.883474 | 03/06/11 11:38 |
| 16/11/18 | ancien | appartement | 3 | 67 | 66.30 | 0 | 1 | 180900 | 6 RUE DE L ESPOIR | Villeurbanne | 45.78149 | 4.883474 | 03/06/11 11:38 |
| 16/12/16 | ancien | appartement | 1 | 28 | NA | 0 | 1 | 97000 | 163 AV ROGER SALENGRO | Villeurbanne | 45.78149 | 4.883474 | 03/06/11 11:38 |
| 17/01/06 | ancien | appartement | 2 | 42 | NA | 0 | 1 | 112000 | 163 AV ROGER SALENGRO | Villeurbanne | 45.78149 | 4.883474 | 03/06/11 11:38 |
| 17/02/01 | ancien | appartement | 1 | 28 | NA | 0 | 1 | 93000 | 4 RUE DE L ESPOIR | Villeurbanne | 45.78149 | 4.883474 | 03/06/11 11:38 |
| 17/02/27 | ancien | appartement | 4 | 84 | 83.60 | 0 | 1 | 152000 | 34 RUE DU LUIZET | Villeurbanne | 45.78149 | 4.883474 | 03/06/11 11:38 |
| 17/02/22 | ancien | appartement | 1 | 28 | 28.30 | 0 | 1 | 83000 | 4 RUE DE L ESPOIR | Villeurbanne | 45.78149 | 4.883474 | 03/06/11 11:38 |
| 17/03/07 | ancien | appartement | 2 | 42 | 42.01 | 0 | 1 | 115400 | 163 AV ROGER SALENGRO | Villeurbanne | 45.78149 | 4.883474 | 03/06/11 11:38 |
In order to find out how much of the data is usable, the first thing we must do is to find out how many cells are actually empty or Na. These are what we’ll refer to as missing values later.
print(apply(is.na(df),2,sum))
## date_transaction type_purchase type_property
## 0 0 0
## rooms_count surface_housing surface_effective_usable
## 0 0 14491
## surface_terrain parkings_count price
## 0 0 0
## address district latitude
## 0 0 143
## longitude date_construction
## 143 0
As we can see, these values are missing:
surface_effective_usable = 14491
latitude = 143
longitude = 143
I wonder if missing latitudes and longitudes are all of the same rows because the number of those missinf is exactly the same.
lat_miss = which(is.na(df$latitude))
long_miss = which(is.na(df$longitude))
all(lat_miss == long_miss)
## [1] TRUE
Thus, we have totally 143 houses (rows) that miss coordinates and all the other values have both latitude and longitude.
Now I want to draw a map just to realize what area on earth is covered in our dataset thus I remove all those rows that miss coordinate values and then draw the map. Also I’m gonna add a color gradient on price column to see if prices are significantly higher in some areas.
#remove all rows with a Na value
#df_noNa <- na.omit(df)
df <- df %>% drop_na(longitude, latitude)
#first draw this to check if the map has proper zoom
sp = ggplot(df, aes(x = longitude, y = latitude, colour = price , shape =type_property)) + geom_point()
show(sp)
#set borders of the map
# border <- c(
# left = min(df$longitude)-(min(df$longitude)*0.1),
# bottom = min(df$latitude)-( min(df$latitude)*0.1),
# right = max(df$longitude)+(max(df$longitude)*0.1),
# top = max(df$latitude)+(max(df$latitude)*0.1)
# )
border <- c(
left = min(df$longitude),
bottom = min(df$latitude),
right = max(df$longitude),
top = max(df$latitude)
)
#get map image
map <- get_stamenmap(
bbox = border,
zoom = 10
)
## Source : http://tile.stamen.com/terrain/10/525/365.png
ggmap(map)+
geom_point(
data = df,
mapping = aes(x = longitude, y = latitude, color = price, shape = type_property)
, size = 1.5
)+ scale_color_gradientn(colours=c("darkorchid1","red"))+scale_shape_manual(values = c(4,15))
It seems like most of reddish points are shaped in square so later we’ll check if houses are generally more expensive than apartments.
The area of the house affects its price. To compare prices, it is useful to add a column for price per \(m^2\)
price = df[, 9]
area = df[, 5]
ppm = price / area
df['price_per_meter'] <- ppm
For further analysis we need to know what are the values of each column.
for (column in colnames(df) ){
print(as.Set(df[column]))
}
## $date_transaction
## {16/07/01, 16/07/04,...,21/06/29, 21/06/30}
##
## $type_purchase
## {ancien, VEFA}
##
## $type_property
## {appartement, maison}
##
## $rooms_count
## {1, 2,...,5, 6}
##
## $surface_housing
## {100, 101,...,98, 99}
##
## $surface_effective_usable
## {100, 100.03,...,99.91, NA}
##
## $surface_terrain
## {0, 102,...,96, 98}
##
## $parkings_count
## {0, 1, 2, 3}
##
## $price
## {100001, 1001000,...,99995, 9e+05}
##
## $address
## {, 1 ALL ATHENA,...,99 RUE PIERRE CORNEILLE, 99 RUE TETE D'OR}
##
## $district
## {Lyon 1er Arrondissement, Lyon 2e Arrondissement,...,Lyon 9e Arrondissement, Villeurbanne}
##
## $latitude
## {45.722048, 45.722102,...,45.805026, 45.805458}
##
## $longitude
## {4.773162, 4.773826,...,4.919765, 4.920161}
##
## $date_construction
## {00/01/08 11:38, 00/01/13 11:38,...,99/12/01 11:38, 99/12/17 11:38}
##
## $price_per_meter
## {1000, 10000,...,9878.0487804878, 9913.4345}
We have two columns that show date, Thus we reformat those values to optimize them for later calculations. Since the hour of construction means nothing, we omit those values. Besides, the time seems to be only two values. let’s check:
modified = strptime(df[['date_construction']], "%Y/%m/%d %H:%M")
time = format(modified, format = "%H:%M:%S")
print(as.Set(time))
## {00:00:00, 11:38:00}
#DO NOT run these lines more than once
df$date_construction = strptime(df[['date_construction']], "%Y/%m/%d")
df$date_construction = as.Date(df$date_construction, format= "%Y-%m-%d")
df$date_transaction = strptime(df[['date_transaction']], "%Y/%m/%d")
df$date_transaction = as.Date(df$date_transaction, format= "%Y-%m-%d")
Also if construction date is in the future and sale type isn’t sale before completion, then data is wrecked and must be removed. However since we know data of this column has been generated by adding 2 years to the date of generating map for house, we don’t change this column to the date when house’s map had been produced and just let it be. This is just a 2 year shift in data and since it’s constant, has no serious affect on our analysis.
now =as.Date(Sys.Date(), format= "%Y-%m-%d")
year(now)<-year(now)-2000
#this is rows that have messed up values!
future = subset(df, date_construction > now & date_construction < "0090-01-01" & type_purchase != "VEFA" )
#remove
#df <- df[(df$date_construction > now & df$date_construction < "0090-01-01" & df$type_purchase != "VEFA"),]
For better comprehension of price range and property type, let’s draw two charts:
barplot(table(df$type_property),col="purple",horiz = T)
# plot the histogram
hist(df$price, main = paste('Prices Distribution'), xlab = 'Price', ylab = 'Count', col="purple",probability = T )
# fit a normal curve to the histogram
lines(density(df$price), col="magenta", lwd=3) # add a density estimate with defaults
lines(density(df$price, adjust=2), lty="dashed", col="darkblue", lwd=3)
# Adding a legend
legend("topright", c("fit normal", "real curve" ),
lty = c(2, 1), col = c("darkblue","magenta"), box.lty = 4, lwd = 3)
Earlier we saw correlation between surface, effective surface and room count. Instead of finding out correlation two by two we use the code below to get it far all columns all at once. I remove geographical coordinates since they don’t provide us much information when treated as numbers. I used use = “pairwise.complete.obs” because we have a fair amount of missing data and so we would be looking for a sensible multiple imputation strategy to fill in the spaces.
num_cols =unlist(lapply(df, is.numeric))
cor_df = df[,num_cols]
cor_df = cor_df[,-which(names(cor_df) %in% c("latitude","longitude"))]
res <-cor(cor_df, use = "pairwise.complete.obs")
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
As we can see effective surface and surface have high correltion.
Before plotting the rest of the columns, we remove missing values and try to remove some columns that don’t provide us fresh information.
#?cor
df_noNa =na.omit(df[,c('surface_housing','surface_effective_usable')])
cor_effective_surface = cor(df_noNa$surface_housing, df_noNa$surface_effective_usable)
cor_room = cor(df$surface_housing, df$rooms_count)
print(paste("correlation of surface and effective surface = ",round(cor_effective_surface,4)))
## [1] "correlation of surface and effective surface = 0.9625"
print(paste("correlation of surface and effective surface = ",round(cor_room,4)))
## [1] "correlation of surface and effective surface = 0.8384"
Although both room count and effective surface seem related to house surface but effective surface has larger correlation, they almost represent the same thing and thus we remove surface_effective_usable column. This column already contains 14491 missing values and since surface_housing is rounded to the lowest integer, comparing surface_effective_usable to it is useless because sometimes surface_effective_usable is even larger than surface_housing by a fraction. But we know that is because the fraction part of surface_housing is removed.
df = df[,-which(names(df) %in% c("surface_effective_usable"))]
So lets draw correlation matrix once again and take a look at relation between columns:
#separate numeric columns
my_data <- df[, c(4:8)]
chart.Correlation(my_data, histogram=TRUE, pch=19)
num_cols =unlist(lapply(df, is.numeric))
cor_df = df[,num_cols]
cor_df = cor_df[,-which(names(cor_df) %in% c("latitude","longitude"))]
res <-cor(cor_df, use = "pairwise.complete.obs")
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
Now we draw diagrams for the rest of the columns
hist(df$parkings_count, xlab = 'Number of Parking spots', col = 'darkorchid1', main = 'Parking Count Histogram')
hist.1 = hist(df$rooms_count, main = 'Number of Rooms', xlab = 'number', col = 'magenta3')
#hist.trimmed = TrimHistogram(hist.1)
hist(df$surface_housing, main = 'Surface Housing', xlab = 'Area', col = 'darkorchid3')
# lines(density(df$surface_housing), col="magenta", lwd=3) # add a density estimate with defaults
# lines(density(df$surface_housing, adjust=2), lty="dashed", col="darkblue", lwd=3)
#
barplot(table(df$type_purchase), col = "mediumorchid")
barplot(table(df$district), col = "mediumorchid3")
Now we plot the scatter plot for room count and area
barplot(table(df$rooms_count), col = "mediumorchid4", main = 'room count barplot')
This chart is good for finding out outliers and ranges but shows nothing
about the relation between number of rooms and area. We’ll discuss this
in next parts.
For a given continuous variable, outliers are those observations that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ is the difference between 75th and 25th quartiles. Look at the points outside the whiskers in below box plot.
outlier_values <- boxplot.stats(df$price)$out # outlier values.
boxplot(df$price, main="Price")
boxplot(df$surface_housing, main="Surface Housing")
We can also detect outliers for different ranges:
For prices, we analyze it based on two values: type of property and area
boxplot(price ~ surface_housing, data=df, main="Boxplot for price vs surface", col = 'darkorchid1')
boxplot(price ~ cut(surface_housing, pretty(df$surface_housing)), data=df, main="Boxplot for price (categorial) vs surface", cex.axis=0.5, col = "magenta4",xlab = 'Area')
boxplot(price ~ type_property, data = df, col = 'mediumorchid1'
)
We can do the same for rooms count
boxplot(rooms_count ~ surface_housing, data=df, col = 'purple',main="Boxplot for rooms count vs surface")
boxplot(rooms_count ~ cut(surface_housing, pretty(df$surface_housing)), xlab = 'Area',data=df, col = 'purple', main="Boxplot for rooms count (categorial) vs surface", cex.axis=0.5)
This is where plots get interesting! except for a few outliers, number of rooms increase almost linear to surface, however after a point, number of room remains constant while surface increases. It seems like, for houses larger than a certain value, people prefer larger rooms than more number of rooms.
we import the .json file and load it into a data frame. Then again draw their map and try to find the nearest station to each house. We’ll add a new column for this value and this will help us in part 3.
The station file contains station names, latitude and longitude for each.
json_file <- fromJSON('station_coordinates.json')
#unlist the json file
json_file <- lapply(json_file, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
stations = c()
lat = c()
long = c()
#iterate through json list and store data in vectors
for (i in (1:length(json_file))){
atomic_vec = json_file[[i]]
n=length(atomic_vec)/3
for (j in (1:n)){
stations = c(stations, atomic_vec[[j]])
}
for (j in ((n+1):(2*n))){
lat = c(lat, atomic_vec[[j]])
}
for (j in ((2*n+1):(3*n))){
long = c(long, atomic_vec[[j]])
}
}
station_df <- data.frame(stations, lat, long)
head(station_df)
| stations | lat | long |
|---|---|---|
| Ampère - Victor Hugo | 45.7530516 | 4.8291603 |
| Cordeliers | 45.7633942 | 4.8358228 |
| Cusset | 45.7656434 | 4.9008478 |
| Flachet - Alain Gilles | 45.7677765 | 4.8893766 |
| Foch | 45.7687864 | 4.8443036 |
| Gratte-Ciel | 45.7690539 | 4.8823107 |
To calculate distance, We use spherical geometry to extract distance from geo coordinates. (I studied Astronomy Olympiad :) )
a typicall spherical triangle
for angle \(c\) in above picture we have:
\[\cos(c) = \cos(a)\cos(b)+ \sin(a)\sin(b)\cos(C) \] to use this formula in our problem, place \(u\) on north pole:
a typicall spherical triangle
Thus the formula becomes like this:
\[\cos(\theta) =\sin(\phi_a)\sin(\phi_b)+ \cos(\phi_a)\cos(\phi_b)\cos(\Delta \lambda) \] where \(\phi\) is latitude and \(\lambda\) in longitude. To get distance in meters:
\[d = \theta^{rad}\times R_{\oplus} \] where \(R_{\oplus} = 6378000\) is radius of earth.
#convert degrees to radians
to_rad <- function(x){
return (x*pi/180)
}
measure_distance <- function(phi1,phi2,l1,l2) {
phi1 = to_rad(phi1)
phi2 = to_rad(phi2)
l1 = to_rad(l1)
l2 = to_rad(l2)
delta = abs(l1-l2)
theta = sin(phi1)*sin(phi2)+cos(phi1)*cos(phi2)*cos(delta)
d = acos(theta)*6378000
return (d)
}
min_distance <- function(x, y) {
all_d = c()
for (i in (1:nrow(station_df))){
all_d = c(all_d, measure_distance(x,as.numeric(station_df$lat[i]),y,as.numeric(station_df$long[i])))
}
return(min(all_d))
}
closest <- function(x, y) {
all_d = c()
for (i in (1:nrow(station_df))){
all_d = c(all_d, measure_distance(x,as.numeric(station_df$lat[i]),y,as.numeric(station_df$long[i])))
}
i <- which.min(all_d)
return(station_df$stations[i])
}
Now we add two columns: 1. distance to the closest station and the name of that station.
new_col = c()
for (i in 1 : nrow(df)){
new_col = c(new_col, min_distance(as.numeric(df[['latitude']][i]), as.numeric(df[['longitude']][i])))
}
df['closest_distance'] <- new_col
new_col = c()
for (i in 1 : nrow(df)){
new_col = c(new_col, closest(as.numeric(df[['latitude']][i]), as.numeric(df[['longitude']][i])))
}
df['closest_station'] <- new_col
let’s draw a chart:
hist(df$closest_distance, main='Distance to Nearest Station', xlab = 'distance', col = 'orchid')
Now we draw the correlation matrix again.
num_cols =unlist(lapply(df, is.numeric))
cor_df = df[,num_cols]
cor_df = cor_df[,-which(names(cor_df) %in% c("latitude","longitude"))]
res <-cor(cor_df, use = "pairwise.complete.obs")
corrplot(res, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
As we can see, our new column closest distance in not
correlated to any of other columns.
#Part 2 : Inference & Visualization
Earlier we saw a relation between number of rooms and area. Here I use room count as a dummy variable and see if the regression gets better or not. To do this we must create a new data frame and also we know that room count column ranges from 1 to 6. thus we put 5 columns for room counts and 6 (or more) will be the indicator.
We’ll also do the same for district. If columns dist 1 to 9 are all zero then district is “Villeurbanne”.
y = df$rooms_count
x = df$surface_housing
# Change point shape (pch = 19) and remove frame.
plot(x, y, main = "room count vs area",
xlab = "area", ylab = "number of rooms",
pch = 19, frame = FALSE)
abline(lm(y ~ x, data = df), col = "purple")
lines(lowess(x, y), col = "blue")
scatterplot(rooms_count ~ surface_housing, data = df, col = 'purple')
#import data
tosave = data.frame(room1 = c(0),room2 = c(0),room3 = c(0), room4= c(0),room5 = c(0),dist1 = c(0),dist2= c(0),dist3 = c(0),dist4 = c(0),dist5 = c(0),dist6= c(0),dist7= c(0),dist8 = c(0),dist9 = c(0),area= c(0))
room_vec <- function(n) {
if (n > 5){
return(c(rep(0,5)))
}
vec = c(rep(0,5))
vec[n] <- 1
return (vec)
}
which_dist <- function(dist) {
vec = c(rep(0,9))
if (dist == "Villeurbanne"){
return(vec)
}
if (dist == "Lyon 1er Arrondissement"){
vec[1] <- 1
}
if (dist == "Lyon 2e Arrondissement"){
vec[2] <- 1
}
if (dist == "Lyon 3e Arrondissement"){
vec[3] <- 1
}
if (dist == "Lyon 4e Arrondissement"){
vec[4] <- 1
}
if (dist == "Lyon 5e Arrondissement"){
vec[5] <- 1
}
if (dist == "Lyon 6e Arrondissement"){
vec[6] <- 1
}
if (dist == "Lyon 7e Arrondissement"){
vec[7] <- 1
}
if (dist == "Lyon 8e Arrondissement"){
vec[8] <- 1
}
if (dist == "Lyon 9e Arrondissement"){
vec[9] <- 1
}
return (vec)
}
#######copy: the damn loop takes 30 minutes to complete!
#
# for (row in rownames(df)) {
# print(row)
# tosave[nrow(tosave) + 1,] = c(room_vec(as.numeric(df[row,4])),which_dist(as.character(df[row,10])),as.numeric(df[row,5]))
# }
#
# ####### removing the first row which was 0 to initialize the data frame
# tosave <- tosave[-1,]
# #
#
# #### because the loop takes too long to complete I ran it once and saved the result to read from it later.
# write.csv(tosave,"roomdummy.csv", row.names = FALSE)
dummydf <- read.csv(file = 'roomdummy.csv')
dummydf['total_price'] = df['price']
dummydf['price_meter'] = df['price_per_meter']
head(dummydf)
| room1 | room2 | room3 | room4 | room5 | dist1 | dist2 | dist3 | dist4 | dist5 | dist6 | dist7 | dist8 | dist9 | area | total_price | price_meter |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 100 | 530000 | 5300.000 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 52 | 328550 | 6318.269 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28 | 42500 | 1517.857 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 67 | 180900 | 2700.000 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28 | 97000 | 3464.286 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42 | 112000 | 2666.667 |
tail(dummydf)
| room1 | room2 | room3 | room4 | room5 | dist1 | dist2 | dist3 | dist4 | dist5 | dist6 | dist7 | dist8 | dist9 | area | total_price | price_meter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40368 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 25 | 159000 | 6360.000 |
| 40369 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 34 | 226000 | 6647.059 |
| 40370 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 33 | 217300 | 6584.848 |
| 40371 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 23 | 145000 | 6304.348 |
| 40372 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 34 | 206000 | 6058.824 |
| 40373 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 30 | 187500 | 6250.000 |
now we do regression for both the dummy variable data and the original one.
reg2 = lm(surface_housing~rooms_count , df)
reg = lm(area~room1+room2+room3+room4+room5 , dummydf)
print("regression for original data:")
## [1] "regression for original data:"
summary(reg2)
##
## Call:
## lm(formula = surface_housing ~ rooms_count, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.446 -8.397 -2.300 5.652 230.603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.25111 0.19661 47.05 <2e-16 ***
## rooms_count 20.04868 0.06487 309.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 40371 degrees of freedom
## Multiple R-squared: 0.7029, Adjusted R-squared: 0.7029
## F-statistic: 9.553e+04 on 1 and 40371 DF, p-value: < 2.2e-16
print("regression for dummy variable version:")
## [1] "regression for dummy variable version:"
summary(reg)
##
## Call:
## lm(formula = area ~ room1 + room2 + room3 + room4 + room5, data = dummydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.708 -8.290 -1.912 5.637 231.088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.7076 0.6300 232.86 <2e-16 ***
## room1 -114.4171 0.6596 -173.45 <2e-16 ***
## room2 -98.3446 0.6465 -152.13 <2e-16 ***
## room3 -77.7952 0.6443 -120.74 <2e-16 ***
## room4 -59.5074 0.6531 -91.12 <2e-16 ***
## room5 -34.6647 0.6986 -49.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.15 on 40367 degrees of freedom
## Multiple R-squared: 0.712, Adjusted R-squared: 0.712
## F-statistic: 1.996e+04 on 5 and 40367 DF, p-value: < 2.2e-16
The \(R^2\) is large and p-value is pretty small. The \(R^2\) and p-values don’t differ much in two versions. That might be because the number of rooms is really a number, and we except that by increasing it, observe increasing in area too. Therefore we can conclude room count and area are very much related.
Generally finding relation between columns is important. one reason is that when we try to visualize this data, the more variables we use, the more dimensions we have and we can’t plot more than 3 dimensions all at once. Thus finding relation between two columns and then removing one of them or replacing both of them with a new value that represents them, (using PCA or t-sne for example) reduces dimensions and make visualization easier.
Let’s draw price map again but this time we use price per meter so that the size of the house doesn’t affects our analysis.
First we draw a map to make distinction between districts visible.
ggplot() +geom_point( data=df, aes(y=latitude, x=longitude, color = as.factor(district)), size = 0.4)
Now draw price gradient for both total price and price per meter
print("price per meter")
## [1] "price per meter"
ggmap(map)+
geom_point(
data = df,
mapping = aes(x = longitude, y = latitude, color = price_per_meter, shape = type_property)
, size = 1.5
)+ scale_color_gradientn(colours=c("black","red"))+scale_shape_manual(values = c(4,15))
print("price")
## [1] "price"
ggmap(map)+
geom_point(
data = df,
mapping = aes(x = longitude, y = latitude, color = price, shape = type_property)
, size = 1.5
)+ scale_color_gradientn(colours=c("black","red"))+scale_shape_manual(values = c(4,15))
From what we saw in first part when we drew map, it seemed like prices of different areas was higher or lower. However now comparing both maps we see that, the difference might be because of houses being larger in different districts rather than more expensive. To make sure of this we do a regression model.
model <- lm(price_per_meter ~ district, data = df)
summary(model)
##
## Call:
## lm(formula = price_per_meter ~ district, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4106.2 -847.0 -98.4 759.6 31387.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4653.86 29.46 157.992 <2e-16 ***
## districtLyon 2e Arrondissement 369.98 44.19 8.373 <2e-16 ***
## districtLyon 3e Arrondissement -489.72 33.65 -14.554 <2e-16 ***
## districtLyon 4e Arrondissement -16.53 39.52 -0.418 0.676
## districtLyon 5e Arrondissement -1073.39 37.49 -28.628 <2e-16 ***
## districtLyon 6e Arrondissement 452.34 38.88 11.635 <2e-16 ***
## districtLyon 7e Arrondissement -394.44 34.17 -11.544 <2e-16 ***
## districtLyon 8e Arrondissement -874.57 34.79 -25.136 <2e-16 ***
## districtLyon 9e Arrondissement -1185.26 36.67 -32.323 <2e-16 ***
## districtVilleurbanne -1290.89 31.95 -40.401 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 40363 degrees of freedom
## Multiple R-squared: 0.1646, Adjusted R-squared: 0.1644
## F-statistic: 883.4 on 9 and 40363 DF, p-value: < 2.2e-16
The \(R^2\) is close to 0 and is very low. Thus district and price are not correlated and in opposition to what we might think, prices in different districts don’t differ.
Here is a histogram of each district’s mean price/meter
dists = unique(df[c('district')])
mean_dist_price = c()
for (i in (1:nrow(dists))){
word = as.character(dists[i,1])
filtered <- df[as.character(df$district) == dists[i,1],]
mean_dist_price <- c(mean_dist_price, mean(filtered$price_per_meter))
}
par(mar=c(11,4,4,4))
barplot(mean_dist_price,
main = "distrinct mean prices",
ylab = "mean price",
names.arg = dists$district, cex.names = 1,las =2,
col = "hotpink")
However the correct way to model a regression between price and district
is to use them as dummy variables. Let’s do this
reg = lm(price_meter~dist1+dist2+dist3+dist4+dist5+dist6+dist7+dist8+dist9 , dummydf)
print("regression for dummy variable version of price/meter vs district:")
## [1] "regression for dummy variable version of price/meter vs district:"
summary(reg)
##
## Call:
## lm(formula = price_meter ~ dist1 + dist2 + dist3 + dist4 + dist5 +
## dist6 + dist7 + dist8 + dist9, data = dummydf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4106.2 -847.0 -98.4 759.6 31387.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3362.98 12.38 271.674 < 2e-16 ***
## dist1 1290.89 31.95 40.401 < 2e-16 ***
## dist2 1660.87 35.18 47.204 < 2e-16 ***
## dist3 801.17 20.44 39.199 < 2e-16 ***
## dist4 1274.35 29.11 43.772 < 2e-16 ***
## dist5 217.50 26.29 8.272 < 2e-16 ***
## dist6 1743.23 28.23 61.742 < 2e-16 ***
## dist7 896.45 21.29 42.116 < 2e-16 ***
## dist8 416.31 22.28 18.689 < 2e-16 ***
## dist9 105.62 25.10 4.208 2.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1256 on 40363 degrees of freedom
## Multiple R-squared: 0.1646, Adjusted R-squared: 0.1644
## F-statistic: 883.4 on 9 and 40363 DF, p-value: < 2.2e-16
The R-squared is very low and thus district is only responsible for 16% of changes in price. One interesting thing is that dist6 has highest coefficient, and we saw earlier that dist6 had the highest mean. Also dist9 has the smallest coefficient as in histogram has the lowest height.
But here we want to compare districts and the goal isn’t to predict price or model it as a function of other variables. price may not entirely depend on district, while still there can be meaningful differences between average prices of districts. Besides, modeling price as a function of just districts, will result in a regression model using just dummy variables which isn’t the best way to get our answer. Therefore for more accurate analysis, we compare districts two by two and do a t-test to find out what districts have meaningful difference. Looking at the histogram, we see that for example districts 2 and 6 have approximately same average but district 5 and 6 are significantly different.
In the t-test, the null hypothesis is that difference in means is 0. (two groups have same average and aren’t different). The alternative is that there is a difference between means. With a threshold of 0.05 we’ll fail to reject the null hypothesis if p-value > 0.05 and that would mean with confidence level of 95%, there isn’t meaningful difference between the price of those districts.
#initialize frame
word = as.character(dists[1,1])
major = df[df$"district" == word, c("price_per_meter")]
minor = df[df$"district" == as.character(dists[1,1]), c("price_per_meter")]
res = t.test(major,minor)
temp = c(word,as.character(dists[1,1]),tidy(res))
tdf = data.frame(temp)
tdf = tdf[-1,]
#iterate through districts and compare each with other ones.
for (i in (1:nrow(dists))){
if (i == 10){
break
}
word = as.character(dists[i,1])
major = df[df$"district" == word, c("price_per_meter")]
for (j in ((i+1):nrow(dists))){
minor = df[df$"district" == as.character(dists[j,1]), c("price_per_meter")]
res = t.test(major,minor)
temp = c(word,as.character(dists[j,1]),tidy(res))
tdf[nrow(tdf)+1,] = temp
}
}
head(tdf,'all')
| X.Villeurbanne. | X.Villeurbanne..1 | estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Villeurbanne | Lyon 1er Arrondissement | -1290.88508 | 3362.978 | 4653.863 | -35.3275890 | 0.0000000 | 2230.533 | -1362.54187 | -1219.22828 | Welch Two Sample t-test | two.sided |
| Villeurbanne | Lyon 2e Arrondissement | -1660.86812 | 3362.978 | 5023.846 | -41.0054642 | 0.0000000 | 1715.569 | -1740.30972 | -1581.42651 | Welch Two Sample t-test | two.sided |
| Villeurbanne | Lyon 3e Arrondissement | -801.16910 | 3362.978 | 4164.147 | -41.7518328 | 0.0000000 | 12218.838 | -838.78226 | -763.55595 | Welch Two Sample t-test | two.sided |
| Villeurbanne | Lyon 4e Arrondissement | -1274.35187 | 3362.978 | 4637.330 | -40.1260694 | 0.0000000 | 2986.459 | -1336.62302 | -1212.08072 | Welch Two Sample t-test | two.sided |
| Villeurbanne | Lyon 5e Arrondissement | -217.49561 | 3362.978 | 3580.473 | -7.7286288 | 0.0000000 | 4163.302 | -272.66808 | -162.32314 | Welch Two Sample t-test | two.sided |
| Villeurbanne | Lyon 6e Arrondissement | -1743.22973 | 3362.978 | 5106.208 | -59.7422081 | 0.0000000 | 3395.631 | -1800.44030 | -1686.01916 | Welch Two Sample t-test | two.sided |
| Villeurbanne | Lyon 7e Arrondissement | -896.44591 | 3362.978 | 4259.424 | -44.1516309 | 0.0000000 | 10180.982 | -936.24536 | -856.64646 | Welch Two Sample t-test | two.sided |
| Villeurbanne | Lyon 8e Arrondissement | -416.31272 | 3362.978 | 3779.291 | -18.8128277 | 0.0000000 | 8094.375 | -459.69163 | -372.93381 | Welch Two Sample t-test | two.sided |
| Villeurbanne | Lyon 9e Arrondissement | -105.62467 | 3362.978 | 3468.603 | -4.6313766 | 0.0000037 | 5703.519 | -150.33373 | -60.91561 | Welch Two Sample t-test | two.sided |
| Lyon 1er Arrondissement | Lyon 2e Arrondissement | -369.98304 | 4653.863 | 5023.846 | -7.1023710 | 0.0000000 | 3114.119 | -472.12293 | -267.84315 | Welch Two Sample t-test | two.sided |
| Lyon 1er Arrondissement | Lyon 3e Arrondissement | 489.71597 | 4653.863 | 4164.147 | 12.8991454 | 0.0000000 | 2574.249 | 415.27095 | 564.16099 | Welch Two Sample t-test | two.sided |
| Lyon 1er Arrondissement | Lyon 4e Arrondissement | 16.53321 | 4653.863 | 4637.330 | 0.3623626 | 0.7171011 | 3812.961 | -72.92077 | 105.98719 | Welch Two Sample t-test | two.sided |
| Lyon 1er Arrondissement | Lyon 5e Arrondissement | 1073.38947 | 4653.863 | 3580.473 | 24.8547293 | 0.0000000 | 3674.873 | 988.71754 | 1158.06139 | Welch Two Sample t-test | two.sided |
| Lyon 1er Arrondissement | Lyon 6e Arrondissement | -452.34466 | 4653.863 | 5106.208 | -10.3110838 | 0.0000000 | 3669.387 | -538.35616 | -366.33316 | Welch Two Sample t-test | two.sided |
| Lyon 1er Arrondissement | Lyon 7e Arrondissement | 394.43917 | 4653.863 | 4259.424 | 10.2343952 | 0.0000000 | 2715.109 | 318.86740 | 470.01093 | Welch Two Sample t-test | two.sided |
| Lyon 1er Arrondissement | Lyon 8e Arrondissement | 874.57236 | 4653.863 | 3779.291 | 22.1227853 | 0.0000000 | 2958.569 | 797.05807 | 952.08664 | Welch Two Sample t-test | two.sided |
| Lyon 1er Arrondissement | Lyon 9e Arrondissement | 1185.26041 | 4653.863 | 3468.603 | 29.6941417 | 0.0000000 | 3009.970 | 1106.99574 | 1263.52508 | Welch Two Sample t-test | two.sided |
| Lyon 2e Arrondissement | Lyon 3e Arrondissement | 859.69901 | 5023.846 | 4164.147 | 20.5702694 | 0.0000000 | 1935.160 | 777.73442 | 941.66360 | Welch Two Sample t-test | two.sided |
| Lyon 2e Arrondissement | Lyon 4e Arrondissement | 386.51625 | 5023.846 | 4637.330 | 7.9110494 | 0.0000000 | 2989.438 | 290.71799 | 482.31451 | Welch Two Sample t-test | two.sided |
| Lyon 2e Arrondissement | Lyon 5e Arrondissement | 1443.37251 | 5023.846 | 3580.473 | 30.9818374 | 0.0000000 | 2744.996 | 1352.02201 | 1534.72300 | Welch Two Sample t-test | two.sided |
| Lyon 2e Arrondissement | Lyon 6e Arrondissement | -82.36162 | 5023.846 | 5106.208 | -1.7441464 | 0.0812435 | 2795.149 | -174.95461 | 10.23138 | Welch Two Sample t-test | two.sided |
| Lyon 2e Arrondissement | Lyon 7e Arrondissement | 764.42220 | 5023.846 | 4259.424 | 18.0642751 | 0.0000000 | 2026.764 | 681.43325 | 847.41116 | Welch Two Sample t-test | two.sided |
| Lyon 2e Arrondissement | Lyon 8e Arrondissement | 1244.55540 | 5023.846 | 3779.291 | 28.7942832 | 0.0000000 | 2188.004 | 1159.79434 | 1329.31645 | Welch Two Sample t-test | two.sided |
| Lyon 2e Arrondissement | Lyon 9e Arrondissement | 1555.24345 | 5023.846 | 3468.603 | 35.6929305 | 0.0000000 | 2234.908 | 1469.79592 | 1640.69098 | Welch Two Sample t-test | two.sided |
| Lyon 3e Arrondissement | Lyon 4e Arrondissement | -473.18277 | 4164.147 | 4637.330 | -14.1722739 | 0.0000000 | 3566.799 | -538.64410 | -407.72143 | Welch Two Sample t-test | two.sided |
| Lyon 3e Arrondissement | Lyon 5e Arrondissement | 583.67349 | 4164.147 | 3580.473 | 19.4765139 | 0.0000000 | 5091.694 | 524.92319 | 642.42379 | Welch Two Sample t-test | two.sided |
| Lyon 3e Arrondissement | Lyon 6e Arrondissement | -942.06063 | 4164.147 | 5106.208 | -30.4435296 | 0.0000000 | 4144.323 | -1002.72851 | -881.39275 | Welch Two Sample t-test | two.sided |
| Lyon 3e Arrondissement | Lyon 7e Arrondissement | -95.27681 | 4164.147 | 4259.424 | -4.1846779 | 0.0000288 | 10988.826 | -139.90621 | -50.64740 | Welch Two Sample t-test | two.sided |
| Lyon 3e Arrondissement | Lyon 8e Arrondissement | 384.85638 | 4164.147 | 3779.291 | 15.7664562 | 0.0000000 | 9495.696 | 337.00791 | 432.70485 | Welch Two Sample t-test | two.sided |
| Lyon 3e Arrondissement | Lyon 9e Arrondissement | 695.54444 | 4164.147 | 3468.603 | 27.7936313 | 0.0000000 | 7105.987 | 646.48736 | 744.60151 | Welch Two Sample t-test | two.sided |
| Lyon 4e Arrondissement | Lyon 5e Arrondissement | 1056.85626 | 4637.330 | 3580.473 | 26.9444149 | 0.0000000 | 4850.805 | 979.96028 | 1133.75224 | Welch Two Sample t-test | two.sided |
| Lyon 4e Arrondissement | Lyon 6e Arrondissement | -468.87786 | 4637.330 | 5106.208 | -11.7293992 | 0.0000000 | 4636.048 | -547.24707 | -390.50866 | Welch Two Sample t-test | two.sided |
| Lyon 4e Arrondissement | Lyon 7e Arrondissement | 377.90596 | 4637.330 | 4259.424 | 11.1015109 | 0.0000000 | 3793.211 | 311.16563 | 444.64629 | Welch Two Sample t-test | two.sided |
| Lyon 4e Arrondissement | Lyon 8e Arrondissement | 858.03915 | 4637.330 | 3779.291 | 24.4036714 | 0.0000000 | 4165.080 | 789.10630 | 926.97200 | Welch Two Sample t-test | two.sided |
| Lyon 4e Arrondissement | Lyon 9e Arrondissement | 1168.72720 | 4637.330 | 3468.603 | 32.8383540 | 0.0000000 | 4170.256 | 1098.95122 | 1238.50318 | Welch Two Sample t-test | two.sided |
| Lyon 5e Arrondissement | Lyon 6e Arrondissement | -1525.73412 | 3580.473 | 5106.208 | -41.0520615 | 0.0000000 | 5288.241 | -1598.59450 | -1452.87375 | Welch Two Sample t-test | two.sided |
| Lyon 5e Arrondissement | Lyon 7e Arrondissement | -678.95030 | 3580.473 | 4259.424 | -22.1200112 | 0.0000000 | 5412.017 | -739.12278 | -618.77783 | Welch Two Sample t-test | two.sided |
| Lyon 5e Arrondissement | Lyon 8e Arrondissement | -198.81711 | 3580.473 | 3779.291 | -6.2265029 | 0.0000000 | 5875.164 | -261.41319 | -136.22103 | Welch Two Sample t-test | two.sided |
| Lyon 5e Arrondissement | Lyon 9e Arrondissement | 111.87094 | 3580.473 | 3468.603 | 3.4524090 | 0.0005597 | 5661.324 | 48.34721 | 175.39468 | Welch Two Sample t-test | two.sided |
| Lyon 6e Arrondissement | Lyon 7e Arrondissement | 846.78382 | 5106.208 | 4259.424 | 26.7563043 | 0.0000000 | 4422.370 | 784.73788 | 908.82977 | Welch Two Sample t-test | two.sided |
| Lyon 6e Arrondissement | Lyon 8e Arrondissement | 1326.91701 | 5106.208 | 3779.291 | 40.3945026 | 0.0000000 | 4855.943 | 1262.51820 | 1391.31583 | Welch Two Sample t-test | two.sided |
| Lyon 6e Arrondissement | Lyon 9e Arrondissement | 1637.60507 | 5106.208 | 3468.603 | 49.1642374 | 0.0000000 | 4778.781 | 1572.30435 | 1702.90579 | Welch Two Sample t-test | two.sided |
| Lyon 7e Arrondissement | Lyon 8e Arrondissement | 480.13319 | 4259.424 | 3779.291 | 18.9808252 | 0.0000000 | 9525.229 | 430.54824 | 529.71815 | Welch Two Sample t-test | two.sided |
| Lyon 7e Arrondissement | Lyon 9e Arrondissement | 790.82124 | 4259.424 | 3468.603 | 30.5452669 | 0.0000000 | 7390.948 | 740.06919 | 841.57330 | Welch Two Sample t-test | two.sided |
| Lyon 8e Arrondissement | Lyon 9e Arrondissement | 310.68805 | 3779.291 | 3468.603 | 11.3617314 | 0.0000000 | 7584.151 | 257.08402 | 364.29208 | Welch Two Sample t-test | two.sided |
the districts that have no meaningful difference in price are:
tdf[which(tdf$'p.value'>0.05),]
## X.Villeurbanne. X.Villeurbanne..1 estimate estimate1 estimate2
## 12 Lyon 1er Arrondissement Lyon 4e Arrondissement 16.53321 4653.863 4637.330
## 21 Lyon 2e Arrondissement Lyon 6e Arrondissement -82.36162 5023.846 5106.208
## statistic p.value parameter conf.low conf.high method
## 12 0.3623626 0.7171011 3812.961 -72.92077 105.98719 Welch Two Sample t-test
## 21 -1.7441464 0.0812435 2795.149 -174.95461 10.23138 Welch Two Sample t-test
## alternative
## 12 two.sided
## 21 two.sided
This result indicates that all district have different prices expect for districts 1 and 4, and 2 and 6 that have same average.
looking at the histogram this result might seem bizarre since it seems like districts 5 and 9 also have the same height but t-test says otherwise. It’s because the bars aren’t next to each other and comparing them with naked eye causes error. We can also look at the averages:
print(mean_dist_price)
## [1] 3362.978 4653.863 5023.846 4164.147 4637.330 3580.473 5106.208 4259.424
## [9] 3779.291 3468.603
districts 1 and 4 means are respectively 4653 and 4637 which is really close thus the difference is 16 (0.3%) . Also districts 2 and 6 means are 5023 and 5106 respectively thus the difference is 83 (1%). These two groups were known as same in the t-test. But districts 5 and 9 means are 3580 and 3468 respectively thus the difference is 112 which is significantly high (3%) and that’s why in t-test 5 and 9 were concluded to be different in prices.
In this part we check whether the prices we have in our data set comes from a normal distribution or not?
Note that for this part we’ll use price per meter values to prevent size of house to interfere with our analysis.
We start by displaying a random sample of 10 rows using the function sample_n()[in dplyr package].
There are several methods for normality test such as Kolmogorov-Smirnov (K-S) normality test and Shapiro-Wilk’s test.
The null hypothesis of these tests is that “sample distribution is normal”. If the test is significant, the distribution is non-normal.
Shapiro-Wilk’s method is widely recommended for normality test and it provides better power than K-S. It is based on the correlation between the data and the corresponding normal scores.
The R function shapiro.test() can be used to perform the Shapiro-Wilk test of normality for one variable (univariate):
set.seed(13)
#Show 10 random rows:
dplyr::sample_n(df, 10)
## date_transaction type_purchase type_property rooms_count surface_housing
## 1 0017-12-11 ancien appartement 2 46
## 2 0020-04-09 ancien appartement 2 62
## 3 0019-05-13 ancien appartement 4 67
## 4 0017-03-10 ancien appartement 4 64
## 5 0017-05-10 VEFA appartement 2 34
## 6 0020-03-12 VEFA appartement 4 79
## 7 0021-04-30 ancien appartement 2 53
## 8 0018-08-31 ancien appartement 5 110
## 9 0019-05-31 ancien appartement 5 115
## 10 0018-12-28 ancien appartement 1 20
## surface_terrain parkings_count price address
## 1 0 1 170000 9 RUE PROFESSEUR PAUL SISLEY
## 2 0 1 218000 63 RUE PAUL VERLAINE
## 3 0 0 164000 173 RUE ANATOLE FRANCE
## 4 0 0 150000 12 AV CDT LHERMINIER
## 5 0 1 153100 16 RUE DES DOCTEURS CORDIER
## 6 0 1 424860 140 RUE COMMANDANT CHARCOT
## 7 0 0 165000 3 RUE DU MARCHE
## 8 0 1 361700 3 RUE ST ROMAIN
## 9 0 1 215000 82 RUE JEAN SARRAZIN
## 10 0 0 74370 131 AV JEAN JAURES
## district latitude longitude date_construction price_per_meter
## 1 Lyon 3e Arrondissement 45.75182 4.869549 0096-08-29 3695.652
## 2 Villeurbanne 45.76419 4.880895 0003-06-11 3516.129
## 3 Villeurbanne 45.77038 4.886646 0003-06-11 2447.761
## 4 Villeurbanne 45.77009 4.883356 0003-06-11 2343.750
## 5 Lyon 9e Arrondissement 45.80075 4.832650 0090-01-01 4502.941
## 6 Lyon 5e Arrondissement 45.74706 4.788048 0090-01-01 5377.975
## 7 Lyon 9e Arrondissement 45.77423 4.807017 0090-01-01 3113.208
## 8 Lyon 8e Arrondissement 45.74261 4.861444 0090-01-01 3288.182
## 9 Lyon 8e Arrondissement 45.73115 4.863194 0090-01-01 1869.565
## 10 Lyon 7e Arrondissement 45.74133 4.839913 0093-04-03 3718.500
## closest_distance closest_station
## 1 99.26294 Dauphiné - Lacassagne
## 2 552.59206 Gratte-Ciel
## 3 358.64530 Flachet - Alain Gilles
## 4 141.03288 Gratte-Ciel
## 5 1717.61237 Cuire
## 6 2511.09511 Hôtel de région - Montrochet
## 7 132.29564 Valmy
## 8 326.75460 Jet d'Eau - Mendès France
## 9 316.86765 Petite Guille
## 10 467.16217 Place Jean Jaurès
first we remove outliers:
outliers <- boxplot(df$price_per_meter, plot=FALSE)$out
x<- df[-which(df$price_per_meter %in% outliers),]
plot(density(df$price_per_meter),main = "Density plot of price/meter (Original)",xlab = "Price/Meter")
plot(density(x$price_per_meter),main = "Density plot of price/meter (after removing outliers)",xlab = "Price/Meter")
#ggqqplot(df$price)
We also draw a histogram and fit a normal curve on it:
g = df$price_per_meter
h <- hist(g, breaks = 100,
col = "mediumpurple3", xlab = "Price/meter", main = "Overall")
xfit <- seq(min(g), max(g), length = 400)
yfit <- dnorm(xfit, mean = mean(g), sd = sd(g))
yfit <- yfit * diff(h$mids[1:2]) * length(g)
lines(xfit, yfit, col = "red", lwd = 2)
Now we do the normality test:
set.seed(2)
#Show 10 random rows:
sample <- dplyr::sample_n(x, 100)
shapiro.test(sample$price_per_meter)
##
## Shapiro-Wilk normality test
##
## data: sample$price_per_meter
## W = 0.97677, p-value = 0.07433
With a threshold of 0.05 for p-value, if \(p_{value} > 0.05\) then with significance level of \(95%\) we fail to reject the null hypothesis. Since p-value is much more than \(0.05\) we fail to reject the normality of the price distribution and with probability more than \(95%\) the price distribution is normal.
Finding out that price is mostly affected by surface housing (in part 1, where we drew correlation matrix), and the fact that it has normal distribution made me wonder if area too had a normal distribution. So we’ll repeat everything for area
set.seed(13)
#Show 10 random rows:
dplyr::sample_n(df, 10)
## date_transaction type_purchase type_property rooms_count surface_housing
## 1 0017-12-11 ancien appartement 2 46
## 2 0020-04-09 ancien appartement 2 62
## 3 0019-05-13 ancien appartement 4 67
## 4 0017-03-10 ancien appartement 4 64
## 5 0017-05-10 VEFA appartement 2 34
## 6 0020-03-12 VEFA appartement 4 79
## 7 0021-04-30 ancien appartement 2 53
## 8 0018-08-31 ancien appartement 5 110
## 9 0019-05-31 ancien appartement 5 115
## 10 0018-12-28 ancien appartement 1 20
## surface_terrain parkings_count price address
## 1 0 1 170000 9 RUE PROFESSEUR PAUL SISLEY
## 2 0 1 218000 63 RUE PAUL VERLAINE
## 3 0 0 164000 173 RUE ANATOLE FRANCE
## 4 0 0 150000 12 AV CDT LHERMINIER
## 5 0 1 153100 16 RUE DES DOCTEURS CORDIER
## 6 0 1 424860 140 RUE COMMANDANT CHARCOT
## 7 0 0 165000 3 RUE DU MARCHE
## 8 0 1 361700 3 RUE ST ROMAIN
## 9 0 1 215000 82 RUE JEAN SARRAZIN
## 10 0 0 74370 131 AV JEAN JAURES
## district latitude longitude date_construction price_per_meter
## 1 Lyon 3e Arrondissement 45.75182 4.869549 0096-08-29 3695.652
## 2 Villeurbanne 45.76419 4.880895 0003-06-11 3516.129
## 3 Villeurbanne 45.77038 4.886646 0003-06-11 2447.761
## 4 Villeurbanne 45.77009 4.883356 0003-06-11 2343.750
## 5 Lyon 9e Arrondissement 45.80075 4.832650 0090-01-01 4502.941
## 6 Lyon 5e Arrondissement 45.74706 4.788048 0090-01-01 5377.975
## 7 Lyon 9e Arrondissement 45.77423 4.807017 0090-01-01 3113.208
## 8 Lyon 8e Arrondissement 45.74261 4.861444 0090-01-01 3288.182
## 9 Lyon 8e Arrondissement 45.73115 4.863194 0090-01-01 1869.565
## 10 Lyon 7e Arrondissement 45.74133 4.839913 0093-04-03 3718.500
## closest_distance closest_station
## 1 99.26294 Dauphiné - Lacassagne
## 2 552.59206 Gratte-Ciel
## 3 358.64530 Flachet - Alain Gilles
## 4 141.03288 Gratte-Ciel
## 5 1717.61237 Cuire
## 6 2511.09511 Hôtel de région - Montrochet
## 7 132.29564 Valmy
## 8 326.75460 Jet d'Eau - Mendès France
## 9 316.86765 Petite Guille
## 10 467.16217 Place Jean Jaurès
first we remove outliers:
outliers <- boxplot(df$surface_housing, plot=FALSE)$out
x<- df[-which(df$surface_housing%in% outliers),]
plot(density(df$surface_housing),main = "Density plot of area (Original)",xlab = "Price/Meter")
plot(density(x$surface_housing),main = "Density plot of area (after removing outliers)",xlab = "Price/Meter")
#ggqqplot(df$price)
set.seed(2)
#Show 10 random rows:
sample <- dplyr::sample_n(x, 100)
shapiro.test(sample$surface_housing)
##
## Shapiro-Wilk normality test
##
## data: sample$surface_housing
## W = 0.98349, p-value = 0.2461
Even though it may not look like it, but due to \(P_{value} > 0.05\) we fail to reject that distribution is normal
why 3 and 4 indicate useful results?
Knowing the distribution of something makes it predictable and easier to estimate. For some important parameter like price, knowing such thing has its own benefits. It can also help us find out prices that are significantly high ar low.
(For example if the price is really low you should suspect that the house is haunted, and you don’t wanna live in it and turn your life into a nightmare. )
We expect for larger houses to be more expensive, however we don’t expect that price per meter to change with area of the house.
model = lm(formula = price ~ surface_housing, df)
summary(model)
##
## Call:
## lm(formula = price ~ surface_housing, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -852973 -57388 -1768 50073 2463692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11056.18 1285.94 -8.598 <2e-16 ***
## surface_housing 4092.05 18.09 226.146 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102600 on 40371 degrees of freedom
## Multiple R-squared: 0.5588, Adjusted R-squared: 0.5588
## F-statistic: 5.114e+04 on 1 and 40371 DF, p-value: < 2.2e-16
plot(df[['surface_housing']], df[['price']], main = paste('Housing Price'), xlab = 'Area', ylab = 'Price', col = 'mediumpurple')
abline(model,lwd= 4, col = 'lightslateblue')
We can use slope and its standard deviation to calculate a confidence
interval for the real slope (and not just the slope for sample)
\[I = [\mu - 1.96\sigma , \mu + 1.96\sigma] \] where 1.96 is the 95% confidence interval for standard normal distribution. \[ \mu = 4092.05\] \[\sigma = 18.09 \] \[ I = [4056, 4127]\] obviously \(0 \notin I\) Thus with 95% probability area and price have linear relation. (well who didn’t already know this? :) )
However doing exact the same things for price/meter and area:
model = lm(formula = price_per_meter ~ surface_housing, df)
summary(model)
##
## Call:
## lm(formula = price_per_meter ~ surface_housing, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3111.7 -987.0 -119.4 870.1 30836.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4196.6511 17.1766 244.32 <2e-16 ***
## surface_housing -3.5399 0.2417 -14.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1371 on 40371 degrees of freedom
## Multiple R-squared: 0.005285, Adjusted R-squared: 0.005261
## F-statistic: 214.5 on 1 and 40371 DF, p-value: < 2.2e-16
plot(df[['surface_housing']], df[['price_per_meter']], main = paste('Housing Price'), xlab = 'Area', ylab = 'Price/meter', col = 'mediumpurple')
abline(model,lwd= 4, col = 'lightslateblue')
Looking at \(R^2\) and plots we’ll
instantly realize that area is responsible for almost \(50%\) of changes in price but has nothing
to do with price/meter since the line is horizontal and R-squared is
very close to 0.
One might think the closer a house is to a station, the more expensive it must be, because of the benefit of its location. Let’s check this out.
model = lm(formula = price_per_meter ~ closest_distance, df)
summary(model)
##
## Call:
## lm(formula = price_per_meter ~ closest_distance, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3205.8 -949.7 -126.6 839.5 30844.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4226.76134 10.17923 415.23 <2e-16 ***
## closest_distance -0.54513 0.01593 -34.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1355 on 40371 degrees of freedom
## Multiple R-squared: 0.0282, Adjusted R-squared: 0.02817
## F-statistic: 1171 on 1 and 40371 DF, p-value: < 2.2e-16
plot(df[['closest_distance']], df[['price_per_meter']], main = paste('Housing Price vs distance to station'), xlab = 'distance to closest station', ylab = 'Price', col = 'mediumpurple')
abline(model,lwd= 4, col = 'lightslateblue')
Despite a slight negative slope, it doesn’t seem that this parameter has much effect on price. nevertheless r-squared is very close to 0 too.
In previous part we analyzed the relation between price and some other factors separately. Now we’ll try to model price as a function of more than one parameter. We saw that area is responsible for 50% of changes in price. We also saw that district has a slight affect on it.
#remove price per meter column and room count
dummydf3 = dummydf[,-c(1,2,3,4,5,17)]
model = lm(formula = price~., dummydf3)
summary(model)
## Warning in summary.lm(model): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = price ~ ., data = dummydf3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.284e-08 -1.500e-12 4.000e-13 2.700e-12 2.286e-10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.084e-11 1.708e-12 2.977e+01 < 2e-16 ***
## dist1 -8.478e-11 2.931e-12 -2.892e+01 < 2e-16 ***
## dist2 -1.239e-10 3.267e-12 -3.792e+01 < 2e-16 ***
## dist3 -6.268e-11 1.884e-12 -3.326e+01 < 2e-16 ***
## dist4 -9.278e-11 2.701e-12 -3.435e+01 < 2e-16 ***
## dist5 -8.858e-12 2.391e-12 -3.704e+00 0.000213 ***
## dist6 -1.455e-10 2.696e-12 -5.396e+01 < 2e-16 ***
## dist7 -5.695e-11 1.956e-12 -2.911e+01 < 2e-16 ***
## dist8 -2.828e-11 2.025e-12 -1.397e+01 < 2e-16 ***
## dist9 -8.159e-13 2.276e-12 -3.580e-01 0.720017
## area -4.669e-12 3.117e-14 -1.498e+02 < 2e-16 ***
## total_price 1.000e+00 5.985e-18 1.671e+17 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.139e-10 on 40361 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.757e+33 on 11 and 40361 DF, p-value: < 2.2e-16
The R-squared is very large! This means that, district and area together determine a very large part of the total price. ## 8: Different prices for different years
In previous part we didn’t consider time to have an affect on price. In this part we try to check price/meter mean for each district but this time separately for each year.
#create new df and add required data to it:
td = df[,c(10,14)]
td['year'] <- year(df$date_transaction)
dists = unique(td[c('district')])
years = unique(td[c('year')])
tdy = data.frame(district = c(0), year = c(0) , mean_price_meter = c(0))
for (i in (1:nrow(dists))){
word = as.character(dists[i,1])
filtered <- td[as.character(td$district) == word,]
for (j in (1:nrow(years))){
y = as.numeric(years[j,1])
filtered2 <- filtered[as.numeric(filtered$year) == y,]
mdtp <- mean(filtered2$price_per_meter)
if(!is.na(mdtp)){
tdy[nrow(tdy)+1,] = c(word , y , mdtp)
}
}
}
#remove first row:
tdy = tdy[-1,]
# from 1990 to 2023
newyear = tdy$year
for (j in (1:nrow(years))){
y = as.numeric(years[j,1])
if (-1 < y & y < 24){
newyear[newyear == y] <- 2000+y
}
else{
newyear[newyear == y] <- 1900+y
}
}
tdy['year'] = newyear
#tdy = tdy[order(tdy$year),]
tdy$mean_price_meter =as.numeric(as.character(tdy$mean_price_meter))
tdy$mean_price_meter = round(tdy$mean_price_meter )
tdy$year =as.numeric(as.character(tdy$year))
ggplot(tdy,aes(x = year, y = mean_price_meter, group = district))+geom_line(aes(color = district) )+geom_point(aes(color = district))
tdyear <- summarise_at(group_by(td, year), vars(price_per_meter), mean)
ggplot(tdyear , aes(x=year, y=price_per_meter)) + geom_line(color = "purple") +geom_point( color = "mediumorchid", size = 3) + ggtitle("Average housing prices/meter vs. Year")
model = lm(price_per_meter ~ year, tdyear)
summary(model)
##
## Call:
## lm(formula = price_per_meter ~ year, data = tdyear)
##
## Residuals:
## 1 2 3 4 5 6
## 6.564 60.892 -49.080 -93.128 57.105 17.646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -602.76 301.25 -2.001 0.116005
## year 250.59 16.22 15.454 0.000102 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.83 on 4 degrees of freedom
## Multiple R-squared: 0.9835, Adjusted R-squared: 0.9794
## F-statistic: 238.8 on 1 and 4 DF, p-value: 0.0001023
The R-squared is 0.98! Almost 1. the R-squared for district was 0.16 and 0.005 for area (recall that it’s price per meter we’re talking about). It seems like the most important factor that affects price/meter is time rather than location.
The fact that prices grow larger with time show inflation, and analysis of price over time can be very useful for economical means
h <- df[, c('rooms_count', 'type_property')]
freq <- dcast(as.data.table(h), rooms_count ~ type_property)
## Using 'type_property' as value column. Use 'value.var' to override
## Aggregate function missing, defaulting to 'length'
freq <- freq[, -1]
dt <- as.table(as.matrix(freq))
balloonplot(t(dt), main = "housetasks", xlab = "type_property", ylab = "rooms_count", label = TRUE, show.margins = FALSE)
boxplot(surface_housing ~ type_property , xlab = 'type of property',data=df, col = 'purple', main="Boxplot for type propertyvs surface")
The plot shows us higher average for maisons’ surface. However the plots
are not very obvious for rooms count. most apartments have 3 rooms and
most houses have 4 rooms. he difference is so little that, we can’t just
make a conclusion out of these plots. Now we test our hypothesis with
t-test.
house = df[df$"type_property" == "maison", c("surface_housing")]
appart = df[df$"type_property" == "appartement", c("surface_housing")]
room_app = df[df$"type_property" == "appartement", c("rooms_count")]
room_house = df[df$"type_property" == "maison", c("rooms_count")]
res = t.test(house,appart)
res2 = t.test(room_house,room_app)
print("t-test result for surface housing")
## [1] "t-test result for surface housing"
res
##
## Welch Two Sample t-test
##
## data: house and appart
## t = 25.493, df = 682.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 36.61301 42.72343
## sample estimates:
## mean of x mean of y
## 104.2288 64.5606
print("t-test result for rooms count")
## [1] "t-test result for rooms count"
res2
##
## Welch Two Sample t-test
##
## data: room_house and room_app
## t = 31.352, df = 695.15, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.327432 1.504795
## sample estimates:
## mean of x mean of y
## 4.184250 2.768136
Thus with probability \(95%\), The averages are not equal. In fact because \(p_{value} < 0.01\) then with 99% confidence level, two groups have different means and thus maisons are generally larger than apartments.
I would want a house with below properties.
Thus the first thing is to fix the location of the house. since all prices and area can be found in all districts, I choose district based on closeness to university at first. Let’s draw a map of districts again but this time also show university location on it.
#coordinates of university
unilong = 4.8655
unilat = 45.7802
ggplot() +geom_point( data=df, aes(y=latitude, x=longitude, color = as.factor(district)), size = 0.4)+geom_circle(aes(x0 = unilong, y0 = unilat, r = 0.005))
The university is in Villeurbanne. From the location of
it, (which isn’t on the border of district and close to any other one),
It’ll be idiotic to buy house anywhere else. However district 6 is also
close But if you recall Part 2.1 where we drew histograms of prices,
Villeurbanne was one the cheapest districts and 6 was
the most expensive! Here we have it again:
dists = unique(df[c('district')])
mean_dist_price = c()
for (i in (1:nrow(dists))){
word = as.character(dists[i,1])
filtered <- df[as.character(df$district) == dists[i,1],]
mean_dist_price <- c(mean_dist_price, mean(filtered$price_per_meter))
}
par(mar=c(11,4,4,4))
barplot(mean_dist_price,
main = "distrinct mean prices",
ylab = "mean price",
names.arg = dists$district, cex.names = 1,las =2,
col = "hotpink")
Thus I choose Villeurbanne over district 6 because of
the prices.
To come up with further detail, we filter only houses in this district:
vil <- filter(df, district == "Villeurbanne")
#some coordinates are missing. I don't want those that don't have coordinates
vil <- na.omit(vil)
#reset index
rownames(vil) <- NULL
Then we design a formula to give each house a certain score based on properties we desire.
score_func <- function(i) {
room_num = vil$rooms_count[i]
if((room_num < 3) || (room_num > 5)){
# I don't want a house that has less than 3 or more than 5 rooms
return(0)
}
area = as.numeric(vil$surface_housing[i])
if(area < 50){
# I don't want a house that is smaller than 50 m^2
return(0)
}
price = as.numeric(vil$price[i])
mean_price = mean(vil$price)
scaled_price = (price - mean_price)/mean_price
d_station = as.numeric(vil$closest_distance[i])
d_uni = measure_distance(as.numeric(vil$latitude[i]),as.numeric(unilat),as.numeric(vil$longitude[i]),as.numeric(unilong) )
#the larger the price, the worst! and it's the most important factor thus its coefficient is 10
#the lesser the distances, the better! however distance to university is more important
score = (-10*scaled_price)+(-5*d_uni)+(-3*d_station)
return(score)
}
s = c()
for (i in (1 : nrow(vil))){
s = c(s , score_func(i))
}
vil['score'] <- s
vil = vil[vil$score != 0, ]
#sort
vil <- vil[order(-vil$score),]
head(vil,20)
| date_transaction | type_purchase | type_property | rooms_count | surface_housing | surface_terrain | parkings_count | price | address | district | latitude | longitude | date_construction | price_per_meter | closest_distance | closest_station | score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3743 | 0016-07-18 | ancien | appartement | 3 | 79 | 0 | 0 | 150000 | 1 RUE DU TONKIN | Villeurbanne | 45.77831 | 4.865586 | 0003-06-11 | 1898.734 | 105.2488 | Condorcet | -1367.261 |
| 3751 | 0020-07-09 | ancien | appartement | 4 | 94 | 0 | 3 | 218000 | 1 RUE DU TONKIN | Villeurbanne | 45.77831 | 4.865586 | 0003-06-11 | 2319.149 | 105.2488 | Condorcet | -1370.602 |
| 3745 | 0017-01-30 | ancien | appartement | 4 | 94 | 0 | 0 | 223550 | 1 RUE DU TONKIN | Villeurbanne | 45.77831 | 4.865586 | 0003-06-11 | 2378.191 | 105.2488 | Condorcet | -1370.874 |
| 3750 | 0020-02-07 | ancien | appartement | 4 | 94 | 0 | 3 | 240000 | 1 RUE DU TONKIN | Villeurbanne | 45.77831 | 4.865586 | 0003-06-11 | 2553.191 | 105.2488 | Condorcet | -1371.683 |
| 3753 | 0020-12-10 | ancien | appartement | 3 | 76 | 0 | 1 | 245000 | 1 RUE DU TONKIN | Villeurbanne | 45.77831 | 4.865586 | 0003-06-11 | 3223.684 | 105.2488 | Condorcet | -1371.928 |
| 3748 | 0019-09-30 | ancien | appartement | 4 | 94 | 0 | 0 | 299730 | 1 RUE DU TONKIN | Villeurbanne | 45.77831 | 4.865586 | 0003-06-11 | 3188.617 | 105.2488 | Condorcet | -1374.617 |
| 3854 | 0016-12-01 | ancien | appartement | 4 | 75 | 0 | 0 | 104643 | 4 RUE DU TONKIN | Villeurbanne | 45.77861 | 4.864574 | 0018-05-23 | 1395.240 | 153.1833 | Condorcet | -1409.876 |
| 3858 | 0020-07-30 | ancien | appartement | 4 | 75 | 0 | 0 | 124853 | 4 RUE DU TONKIN | Villeurbanne | 45.77861 | 4.864574 | 0018-05-23 | 1664.707 | 153.1833 | Condorcet | -1410.869 |
| 3852 | 0016-11-10 | ancien | appartement | 4 | 75 | 0 | 1 | 133600 | 50 BD DU ONZE NOVEMBRE 1918 | Villeurbanne | 45.77861 | 4.864574 | 0018-05-23 | 1781.333 | 153.1833 | Condorcet | -1411.299 |
| 3857 | 0019-11-21 | ancien | appartement | 3 | 67 | 0 | 2 | 135264 | 50 BD DU ONZE NOVEMBRE 1918 | Villeurbanne | 45.77861 | 4.864574 | 0018-05-23 | 2018.866 | 153.1833 | Condorcet | -1411.380 |
| 3853 | 0016-11-29 | ancien | appartement | 5 | 97 | 0 | 0 | 135675 | 4 RUE DU TONKIN | Villeurbanne | 45.77861 | 4.864574 | 0018-05-23 | 1398.711 | 153.1833 | Condorcet | -1411.401 |
| 3855 | 0017-03-06 | ancien | appartement | 4 | 87 | 0 | 1 | 141400 | 4 RUE DU TONKIN | Villeurbanne | 45.77861 | 4.864574 | 0018-05-23 | 1625.287 | 153.1833 | Condorcet | -1411.682 |
| 3718 | 0017-04-19 | ancien | appartement | 3 | 65 | 0 | 1 | 137500 | 40 BD DU ONZE NOVEMBRE 1918 | Villeurbanne | 45.77871 | 4.863584 | 0003-06-11 | 2115.385 | 225.5238 | Condorcet | -1788.533 |
| 3719 | 0017-08-01 | ancien | appartement | 3 | 65 | 0 | 1 | 156500 | 40 BD DU ONZE NOVEMBRE 1918 | Villeurbanne | 45.77871 | 4.863584 | 0003-06-11 | 2407.692 | 225.5238 | Condorcet | -1789.467 |
| 3717 | 0017-03-01 | ancien | appartement | 3 | 65 | 0 | 1 | 168000 | 36 BD DU ONZE NOVEMBRE 1918 | Villeurbanne | 45.77871 | 4.863584 | 0003-06-11 | 2584.615 | 225.5238 | Condorcet | -1790.032 |
| 3727 | 0019-01-14 | ancien | appartement | 4 | 74 | 0 | 1 | 195500 | 7 AV ROBERTO ROSSELLINI | Villeurbanne | 45.77871 | 4.863584 | 0003-06-11 | 2641.892 | 225.5238 | Condorcet | -1791.383 |
| 3730 | 0019-05-23 | ancien | appartement | 3 | 65 | 0 | 1 | 215000 | 40 BD DU ONZE NOVEMBRE 1918 | Villeurbanne | 45.77871 | 4.863584 | 0003-06-11 | 3307.692 | 225.5238 | Condorcet | -1792.341 |
| 3645 | 0017-10-10 | ancien | appartement | 3 | 68 | 0 | 1 | 145000 | 14 RUE DU TONKIN | Villeurbanne | 45.77795 | 4.864337 | 0003-06-11 | 2132.353 | 203.4334 | Condorcet | -1936.011 |
| 3643 | 0017-04-24 | ancien | appartement | 4 | 81 | 0 | 1 | 160000 | 18 RUE DU TONKIN | Villeurbanne | 45.77795 | 4.864337 | 0003-06-11 | 1975.309 | 203.4334 | Condorcet | -1936.748 |
| 3652 | 0018-03-19 | ancien | appartement | 4 | 81 | 0 | 1 | 172100 | 18 RUE DU TONKIN | Villeurbanne | 45.77795 | 4.864337 | 0003-06-11 | 2124.691 | 203.4334 | Condorcet | -1937.343 |
These are houses with highest score in descending order.
And below is the score map
ggmap(map)+
geom_point(
data = vil,
mapping = aes(x = longitude, y = latitude, color = score, shape = type_property)
, size = 1.5
)+ scale_color_gradientn(colours=c("cyan","darkorchid1"))+scale_shape_manual(values = c(4,15))
ggplot() +geom_point( data=vil, aes(y=latitude, x=longitude, color = score), size = 1.5)+geom_circle(aes(x0 = unilong, y0 = unilat, r = 0.007))+ scale_color_gradientn(colours=c("cyan","darkorchid1"))+scale_shape_manual(values = c(4,15))
Here is the area where I would want to buy a house. And the first rows of vil data frame are the most ideal houses.